142 research outputs found

    Analysis of metabolomic data: tools, current strategies and future challenges for omics data integration

    Get PDF
    Metabolomics is a rapidly growing field consisting of the analysis of a large number of metabolites at a system scale. The two major goals of metabolomics are the identification of the metabolites characterizing each organism state and the measurement of their dynamics under different situations (e.g. pathological conditions, environmental factors). Knowledge about metabolites is crucial for the understanding of most cellular phenomena, but this information alone is not sufficient to gain a comprehensive view of all the biological processes involved. Integrated approaches combining metabolomics with transcriptomics and proteomics are thus required to obtain much deeper insights than any of these techniques alone. Although this information is available, multilevel integration of different 'omics' data is still a challenge. The handling, processing, analysis and integration of these data require specialized mathematical, statistical and bioinformatics tools, and several technical problems hampering a rapid progress in the field exist. Here, we review four main tools for number of users or provided features (MetaCore(TM), MetaboAnalyst, InCroMAP and 3Omics) out of the several available for metabolomic data analysis and integration with other 'omics' data, highlighting their strong and weak aspects; a number of related issues affecting data analysis and integration are also identified and discussed. Overall, we provide an objective description of how some of the main currently available software packages work, which may help the experimental practitioner in the choice of a robust pipeline for metabolomic data analysis and integration

    Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios

    Get PDF
    Background and objective: Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task. Methods: We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced. Results: We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases. Conclusions: ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features

    RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

    Get PDF
    Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures

    Conceptual modeling for genomics: Building an integrated repository of open data

    Get PDF
    Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete. In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus

    BITS 2015: The annual meeting of the Italian Society of Bioinformatics

    Get PDF
    This preface introduces the content of the BioMed Central journal Supplements related to the BITS 2015 meeting, held in Milan, Italy, from the 3th to the 5th of June, 2015

    Genomic data integration and user-defined sample-set extraction for population variant analysis

    Get PDF
    Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics

    Explorative search of distributed bio-data to answer complex biomedical questions

    Get PDF
    Background The huge amount of biomedical-molecular data increasingly produced is providing scientists with potentially valuable information. Yet, such data quantity makes difficult to find and extract those data that are most reliable and most related to the biomedical questions to be answered, which are increasingly complex and often involve many different biomedical-molecular aspects. Such questions can be addressed only by comprehensively searching and exploring different types of data, which frequently are ordered and provided by different data sources. Search Computing has been proposed for the management and integration of ranked results from heterogeneous search services. Here, we present its novel application to the explorative search of distributed biomedical-molecular data and the integration of the search results to answer complex biomedical questions. Results A set of available bioinformatics search services has been modelled and registered in the Search Computing framework, and a Bioinformatics Search Computing application (Bio-SeCo) using such services has been created and made publicly available at http://www.bioinformatics.deib.polimi.it/bio-seco/seco/. It offers an integrated environment which eases search, exploration and ranking-aware combination of heterogeneous data provided by the available registered services, and supplies global results that can support answering complex multi-topic biomedical questions. Conclusions By using Bio-SeCo, scientists can explore the very large and very heterogeneous biomedical-molecular data available. They can easily make different explorative search attempts, inspect obtained results, select the most appropriate, expand or refine them and move forward and backward in the construction of a global complex biomedical query on multiple distributed sources that could eventually find the most relevant results. Thus, it provides an extremely useful automated support for exploratory integrated bio search, which is fundamental for Life Science data driven knowledge discovery

    MicroGen: a MIAME compliant web system for microarray experiment information and workflow management

    Get PDF
    BACKGROUND: Improvements of bio-nano-technologies and biomolecular techniques have led to increasing production of high-throughput experimental data. Spotted cDNA microarray is one of the most diffuse technologies, used in single research laboratories and in biotechnology service facilities. Although they are routinely performed, spotted microarray experiments are complex procedures entailing several experimental steps and actors with different technical skills and roles. During an experiment, involved actors, who can also be located in a distance, need to access and share specific experiment information according to their roles. Furthermore, complete information describing all experimental steps must be orderly collected to allow subsequent correct interpretation of experimental results. RESULTS: We developed MicroGen, a web system for managing information and workflow in the production pipeline of spotted microarray experiments. It is constituted of a core multi-database system able to store all data completely characterizing different spotted microarray experiments according to the Minimum Information About Microarray Experiments (MIAME) standard, and of an intuitive and user-friendly web interface able to support the collaborative work required among multidisciplinary actors and roles involved in spotted microarray experiment production. MicroGen supports six types of user roles: the researcher who designs and requests the experiment, the spotting operator, the hybridisation operator, the image processing operator, the system administrator, and the generic public user who can access the unrestricted part of the system to get information about MicroGen services. CONCLUSION: MicroGen represents a MIAME compliant information system that enables managing workflow and supporting collaborative work in spotted microarray experiment production

    Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing protein family and domain based annotations constitute important information to understand protein functions and gain insight into relations among their codifying genes. To allow analyzing of gene proteomic annotations, we implemented novel modules within <it>GFINDer</it>, a Web system we previously developed that dynamically aggregates functional and phenotypic annotations of user-uploaded gene lists and allows performing their statistical analysis and mining.</p> <p>Results</p> <p>Exploiting protein information in Pfam and InterPro databanks, we developed and added in <it>GFINDer </it>original modules specifically devoted to the exploration and analysis of functional signatures of gene protein products. They allow annotating numerous user-classified nucleotide sequence identifiers with controlled information on related protein families, domains and functional sites, classifying them according to such protein annotation categories, and statistically analyzing the obtained classifications. In particular, when uploaded nucleotide sequence identifiers are subdivided in classes, the <it>Statistics Protein Families&Domains </it>module allows estimating relevance of Pfam or InterPro controlled annotations for the uploaded genes by highlighting protein signatures significantly more represented within user-defined classes of genes. In addition, the <it>Logistic Regression </it>module allows identifying protein functional signatures that better explain the considered gene classification.</p> <p>Conclusion</p> <p>Novel <it>GFINDer </it>modules provide genomic protein family and domain analyses supporting better functional interpretation of gene classes, for instance defined through statistical and clustering analyses of gene expression results from microarray experiments. They can hence help understanding fundamental biological processes and complex cellular mechanisms influenced by protein domain composition, and contribute to unveil new biomedical knowledge about the codifying genes.</p
    • …
    corecore